Efficient In-memory Data Structures for n-grams Indexing
نویسندگان
چکیده
Indexing n-gram phrases from text has many practical applications. Plagiarism detection, comparison of DNA of sequence or spam detection. In this paper we describe several data structures like hash table or B+ tree that could store n-grams for searching. We perform tests that shows their advantages and disadvantages. One of neglected data structure for this purpose, ternary search tree, is deeply described and two performance improvements are
منابع مشابه
Space Efficient Data Structures for N-gram Retrieval
A significant problem in computer science is the management of large data strings and a great number of works dealing with the specific problem has been published in the scientific literature. In this article, we use a technique to store efficiently biological sequences, making use of data structures like suffix trees and inverted files and also employing techniques like n-grams, in order to im...
متن کاملIndexing DNA Sequences Using q-Grams
We have observed in recent years a growing interest in similarity search on large collections of biological sequences. Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database and sidestep the need for linear scan of the entire database. Two level index – hash table and c-trees – are ...
متن کاملAlgorithms for speech indexing in microsoft recite
Microsoft Recite is a mobile application to store and retrieve spoken notes. Recite stores and matches n-grams of pattern class identifiers that are designed to be language neutral and handle a large number of out of vocabulary phrases. The query algorithm expects noise and fragmented matches and compensates for them with a heuristic ranking scheme. This contribution describes a class of indexi...
متن کاملEfficient In-Memory Indexing with Generalized Prefix Trees
Efficient data structures for in-memory indexing gain in importance due to (1) the exponentially increasing amount of data, (2) the growing main-memory capacity, and (3) the gap between main-memory and CPU speed. In consequence, there are high performance demands for in-memory data structures. Such index structures are used—with minor changes—as primary or secondary indices in almost every DBMS...
متن کاملEfficient Evaluation of Continuous Range Queries on Moving Objects
Abstract. In this paper we evaluate several in-memory algorithms for efficient and scalable processing of continuous range queries over collections of moving objects. Constant updates to the index are avoided by query indexing. No constraints are imposed on the speed or path of moving objects. We present a detailed analysis of a grid approach which shows the best results for both skewed and uni...
متن کامل